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ABSTRACT. We introduce the software tool NTRFinder to find the complex repetitive structure in DNA we call a nested 
tandem repeat (NTR). An NTR is a recurrence of two or more distinct tandem motifs interspersed with each other. We propose 
that nested tandem repeats can be used as phylogenetic and population markers. 

We have tested our algorithm on both real and simulated data, and present some real nested tandem repeats of interest. 
I ' We discuss how the NTR found in the ribosomal DNA of taro (Colocasia esculenta) may assist in determining the cultivation 

I , prehistory of this ancient staple food crop. 

f ^ ■ NTRFinder can be downloaded from |http://www.maths.otago.ac.nz/~aam atroud/ . 
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1. Introduction 

£f) \ Genomic DNA has long been known to contain tandem repeats: repetitive structures in which many approximate 
copies of a common segment (the motif) appear consecutively. Several studies have proposed different mechanisms for 
' the occurrence of tandem repeats [Weitz mann et al., 1997[|Wells, 1996] , but their biological role is not well understood. 
Recently we have observed a more complex repetitive structure in the ribosomal DNA of Colocasia esculenta (taro), 
consisting of multiple approximate copies of two distinct motifs interspersed with one another. We call such structures 
nested tandem repeats (NTRs), and the problem of finding them in sequence data is the focus of this paper. Our motivation 
is their potential use for studying populations: for example, a preliminary analysis suggests that changes in the NTR in 
taro have been occurring on a 1,000 year time scale, so a greater understanding of this NTR offers the potential to date 
the early agriculture of this ancient staple food crop. 

The problem of locating tandem repeats is well known, as their implication for neurological disorders [Ma cdonald et ah, 1993 
\ |Fu et al., 1992| , and their use to infer evolutionary histories has urged some researchers to develop tools to find them. This 
has resulted in a number of software tools, each of which has its own strengths and limitations. However, the existing tools 
were not designed to find NTRs, and consequently do not generally find them. In this paper, we present a new software 
C") . tool, NTRFinder, which is designed to find these more complex repetitive structures. 

We report here the algorithm on which NTRFinder is based and report some of the NTRs it has identified, including 
an even more complex structure where copies of four distinct motifs are interspersed. 
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2. Background and Definitions 

2.1. Sequences, edit operations and the edit distance. A DNA sequence is a sequence of symbols from the nucleotide 
alphabet E = {A, C, G, T}. We define a DNA segment to be a string of contiguous DNA nucleotides and define a site to be 
a component in a segment. For a DNA segment 

X = xix 2 ■ ■ ■ x n , 

Xi G £ is the nucleotide at the i-th site and |X| = n is the length of X. 

Copying errors happen in DNA due to different external and internal factors. These changes include substitution, 
insertion, deletion, duplication, and contraction. We refer to these as edit operations. By giving each type of edit operation 
some specific weight, we can in principle find a series of edit operations which transform segment x to segment y, whose 
sum of weights is minimal. We will refer to this sum as the edit distance, and denote it by d(x, y). For the purposes of this 
paper, the edit operations allowed in calculating the edit distance are single nucleotide substitutions, and single nucleotide 
insertions or deletions (indels), with each given weight 1 . 

2.2. Classification of Tandem Repeats. Many classifications of tandem repeat schemas have been introduced in the 
computational biology literature. We list some which are commonly used: 

• (Exact) Tandem Repeats: An exact tandem repeat (TR) is a sequence comprising two or more contiguous copies 
XX • • • X of identical segments X (referred to as the motif). 

• k— Approximate Tandem Repeats: A k— approximate tandem repeat (k— TR) is a sequence comprising two or 
more contiguous copies X1X2 • • • X n of similar segments, where each individual segment Xi is edit distance at 
most k from a template segment X. 
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• Multiple Length Tandem Repeats: A multiple length tandem repeat is a tandem repeat where each repeat copy 
is of the form Xx", where n is a constant larger than one and c/(X, x) is greater than some threshold value k. 

Examples: 

• TR: 

AGG AGG AGG AGG AGG. The motif is AGG. 

• 1 TR: 

AGG AGC ATG AGG CGG. The motif is AGG. 

• MLTR: 

GACCTTTGG ACGGT ACGGT ACGGT 
GACCTTTGG ACGGT ACGGT ACGGT. 
The motifs are ACGGT and GACCTTTGG, with n = 3. 

2.3. Nested Tandem Repeats. In this section we introduce a more complex repetitive structure, the nested tandem repeat 
(NTR), also referred to as a variable length tandem repeat ]Hauth and Joseph, 20 02] . Let X and x be two segments 
(typically of different lengths) from the alphabet £ = {A, C, G, T}, such that <i(X, x) is greater than some threshold value 
k. 

Definition 1. An exact nested tandem repeat is a string of the form 

x S0 Xx Sl X---Xx s ", 

where n > 1, Si > 1 for each < i < n, and Sj > 2 for some j G [0, 1, ■ • • , n]. The motif x is called the tandem repeat 
and the motif X is the interspersed repeat. The concatenations of the tandem repeats x Si alone, and of the interspersed 
motifs X alone, both form exact tandem repeats. 

Example: x = ACGGT, X = GACCTTTGG, n = 7, s = 0, Si = 3, s 2 = 5, s 3 = 2, s 4 = 4, s 5 = 1, s 6 = s 7 = 2, so 

7 

x° J^J Xx' Si =XxxxXxxxxxXxxXxxxxXxXxxXxx 

i=l 

=GACCTTTGG ACGGT ACGGT ACGGT 
GACCTTTGG ACGGT ACGGT ACGGT ACGGT ACGGT 
GACCTTTGG ACGGT ACGGT 
GACCTTTGG ACGGT ACGGT ACGGT ACGGT 
GACCTTTGG ACGGT 
GACCTTTGG ACGGT ACGGT 
GACCTTTGG ACGGT ACGGT. 

In practice we expect any nested tandem repeats occurring in DNA sequences to be approximate rather than exact. In 
what follows we will write X to mean an approximate copy of the motif X, and x s to mean an approximate tandem repeat 
consisting of s approximate copies of the motif x. 

Definition 2. A {k\,ki)-approximate nested tandem repeat is a string of the form 

x S0 Xx Sl X..-Xx s ", 

where n and Si satisfy the same conditions in Definition Q] and x s °x Sl • • • x s " is a k\ -approximate tandem repeat with 
motif x, and XX • • • X is a k 2 -approximate tandem repeat with motif X. 

Examples: 

• NTR: 

AGG AGG CTCAG AGG CTCAG AGG AGG AGG CTCAG. 
The motifs are AGG, CTCAG. 

• (1,2)-NTR: 

AGA AGG CTTCG AGG CTCAG AA AGA AGG CTTCG AGG 
CTCAG A AG. 

The motifs are AGG, CTCAG. 
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3. Related Work 



Various algorithms have been introduced to find exact tandem repeats. Such algorithms were developed mainly 
for theoretical purposes, namely, to solve the problem of finding squares in strings [Apostol ico and Preparata, 1983| 



ICrochemore, 1981) Kolpakov et al., 2001 |Main and Lorentz, 1984[ Stoye and Gusfield, 2002 1 . These algorithms are not 



easily adapted to finding the approximate tandem repeats that usually occur in DNA. 

A number of algorithms, | Delgrange and Rivals, 2004 [Landau et al ., 200l| consider motifs differing only by substi- 



tutions, using the Hamming distance as a measure of similarity. Others, e.g. HBenson, 1999| |Hauth and Joseph,~2 002 



Dom anig and P reparata, 2007] |Sagot and Myers, 1998 Wexler et al., 2005| , have considered insertions and deletions by 
using the edit distance. Most of these algorithms have two phases, a scanning phase that locates candidate tandem repeats, 
and an analysis phase that checks the candidate tandem repeats found during the scanning phase. 

The only algorithm designed to look for NTRs is that of Hauth and Joseph (2002), which searches for tandem motifs 
of length at most six nucleotides. 



4. The Algorithm 

In this section we present the algorithm we have developed to search for nested tandem repeats in a DNA sequence. 
The algorithm requires several preset parameters. These are: k\ and k 2 which bound the edit distances from the tandem 
and interspersed motifs; and the motif length bounds min tl , max tl , min t2 , max t2 . Other input parameters are discussed 
below. 

Search phase. Our search is confined to seeking NTRs with motifs of length l\ £ [min fl , maxtj and I2 £ [min i2 , max t2 ]. 
A (fci, hz)— NTR must contain a k\ — TR, so we begin by scanning the sequence for approximate tandem repeats. Several 
good algorithms, including those of Benson (1999), Wexler et al. (2005) and Domanig and Perparata (2007), have been 
developed to find k\— TRs. We have chosen to adapt the algorithm of Wexler et al. (2007), where the sequence is scanned 
by two windows w\, u>2 of width w, a distance l\ apart. Wexler's algorithm uses a similarity parameter q with default 
value q — 0.5, which can be reset by the user. The user may set the k\, k 2 values, preset with default values 

fci = Zi(l -p m ) + \/h(l -p m )p m 



k 2 =l 2 (l- p m ) + -Pm)p m , 

following Domanig and Preparata (2007), with matching probability p m given the default value p m = 0.8. 

Once a TR has been found and its full extent determined, the right-most copy of the repeated pattern is taken as the 
current TR motif x, and further approximate copies of x are sought, displaced from the TR up to a distance of max t2 
nucleotides to the right. If no further approximate copies of x are located, this TR is abandoned, and the TR search 
continues to the right. If a displaced approximate copy of x is observed, then both x and the interspersed segment X are 
recorded in a list, as we have found a candidate NTR. Further contiguous copies of x are then sought, with the rightmost 
copy x replacing the previous template motif. 

The steps above are repeated with successive motifs x and interspersed segments copied to the list, until no additional 
copies of the last recorded motif x are found. This search phase is illustrated in Figure 1 . 

At this point the algorithm builds consensus patterns for x and X using majority rule. After constructing the two 
consensus patterns the algorithm moves to the verification phase. 

Example: An example will help illustrate the procedure. Suppose that S contains an NTR of the form 

xXoxxxXi xxxxxxX2 XXX3 . 

The algorithm will scan from the left until it locates the tandem repeat consisting of three copies of x between Xo and 
Xi. It will then start searching for additional non-adjacent copies of x to the right, locating the first copy to the right of 
Xi. Having found this it will record the intervening segment Xi, and then continue the tandem repeat search from this 
point until the full extent of the tandem repeat between Xi and X2 is found. 

This procedure is repeated once more, locating the tandem repeat between X2 and X3, recording the segment X2, and 
then searching for further copies to the right. At this point no more copies of x are found, and the process of verification 
begins. The segments Xo, X3 and the initial copy of x are found during this stage. 

Verification phase: Each candidate NTR is checked to determine whether it meets the NTR definition. This is accom- 
plished by aligning the candidate NTR region, together with a margin on either side of it, against the consensus motifs 
x and X, using the nested wrap-around dynamic programming algorithm of Matroud et al. (2010). This has complexity 
0(n|x| |X|), where n is the length of the NTR region and |x| and |X| are the length of the tandem motif and the length of 
the interspersed motif respectively. 
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Figure 1 . Flowchart of the NTRFinder algorithm. 



5. Results 

5.1. Tests on real sequence data. We have implemented our algorithm and carried out searches for NTRs on some DNA 
sequences taken from GenBank. The size ranges used for this search were [min tl , max tl ] = [min t2 , max t2 ] = [2, 100], 
with the parameters k\, &2 and q left set to their default values. Some NTR regions found by our software are listed in 
Table Q] 

5.2. More complex structures. In addition to the nested tandem repeats in TableQ] NTRFinder also reported an NTR 
in Linum usitatissimum (accession number gi — 164684852 — gb — EU307 117.1 — ) which on further analysis by hand 
turned out to have a more complex structure. The IGS region of the rDNA of this species contains an NTR with four 
motifs interspersed with each other. The four motifs are w=GTGCGAAAAT, a;=GCGCGCCAGGG, y=GCACCCATAT, and 
z=GCGATTTTG and the structure of the NTR is 

25 

i=i 

where q, € {1, 2, 3}; r % € {1, 2}; s % € {0, 1}; U e {0, 1}. 

5.3. Running time. The running time for NTRFinder searching some sequences from GenBank is shown in Figure[2] 
It can be seen that the run time is approximately linear in the length of the sequence. However, it must be noted that 
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Species 
Accession number 


tandem motif x 
interspersed motif X 


|x| 

Yl 


start index 
end index 


#x 

ft A 


N. sylvestris 

A /OUjO. 1 


x= AGGACATGGC 
A— L.A1 LrLrL-ALLrALr 1 L. 


10 

1 J 


960 

111 
z,l 1 1 


53 

TA 
ZO 


B. juncea 

y*7 inn 1 
A /JUJz. 1 


x= GGACGTCCCGCGTGCACAGAC 

A— L AL.ALrAL.LrLr 1 LLrAL.L. 1 LtLtAL.LjAL.L. 1 LjL.Lt 1 Lr 


21 


1,403 
z,oUj 


51 

1 


B. olerecea 

AOUjZ4. 1 


x= GGACAGTCCTCGTGGGCGAAAATCACCCAC 
V— r,r, ata r.Trr a f^aaci a \f '.f .f '.(^(^ \ a rr.Tr,rTr, ATATr,rr T T a ttp, a r 

A— LrLr Al ALr 1 L.L.AL.LrLrLiAALrLrLrL.L.AAL.Lr 1 LrL 1 LrAl Al LrLLr 1 AL. 1 Lr AL 


30 

A A 


1,256 
1 i/i 1 

J,J41 


32 

zU 


B. rapa 

C70179 1 
J / 1 1 z. 1 


x= GGATCAGTACAC 
v— riTrr Arr.r.r.A A^nnrrA Ar'ATP.PTnATATr^TnTA ATArArnnArA 


12 

HO 


385 


20 

Q 
O 


5. campestris 
j / ol /z. i 


x= GGACGTCCTTTGTGTGCTGAC 

A— LrLr AL.AL.AL.LrLr ALAL.AL.AL.LrLrAL.ALjL.L.AL.LrLrLrAALrLrLrL.L.ALjLLr 1 Lj 1 LrL. 1 LrAL- 


21 

c 1 
J 1 


1,558 
z,DoU 


37 

Q 
O 


C. esculenta 
Not published 


x= TCGCACAGCCG 
A— 1 1 L 1 LrLrLrL AAAALAj/LtL 1 LrLrLr 1 LtAL.Lt 1 LrL. 1 Lr AAL 1 LrLrLL ALrL. 1 LrLr 1 1 L-Lr 


11 
4o 


725 

z3o4 


94 

1 1 
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D. melanogaster 

A T3fl1 /I TQA A 
AHU 1 4Z? D .4 


x= TGCCCCAGT 
Y— Tr.rTr.cTrr.rcTr.r.r 

A— 1 LrL 1 LrL 1 LULL 1 LrLrL. 


9 


4,215,779 

A 1 S SQO 

4,z 1 j,oyy 


7 




//. sapien X chromosome 
AL672277.21 


x= CT 

X= CACAAGGAGCTGCTCTCCTCCTTTCTTCTGTTGAGACGTGTGTGTGTCTGTCTTT 


2 

55 


35,471 
35,711 


360 
8 


//. sapien X chromosome 
AL683871.15 


x= GATA 

X= TGATGGTAATAGATACATACTTAGGTA 


4 

27 


111,705 
113,805 


147 
56 



TABLE 1 . Nested tandem repeats found in some sequences from GenBank and an additional unpub- 



lished sequence (C. esculenta). 



the run time depends not only on the length of the input sequence, but also on the number of tandem and nested tandem 
repeats found in the sequence. The program spends most of the time verifying any tandem repeats found. 

6. Discussion 

In the last decade a number of software tools to find tandem repeats have been introduced; however, little work exists 
on more complex repetitive structures such as nested tandem repeats. The problem of finding nested tandem repeats is 
addressed in this study. The motivation for our study is the potential use of NTRs as a marker for genetic studies of 
populations and of species. 

We have done some analysis on the nested tandem repeat in the intergenic spacer region in C. esculenta (taro), noting 
some variation in the NTRs derived from domesticated varieties sourced from New Zealand, Australia and Japan. Further 
varieties are currently being analysed. By considering some edit operations such as deletion, mutation, and duplication 
we can align the nested tandem repeat regions of each pair of sequences. The alignment score can then be considered as 
a measure of distance between both sequences. In particular they appear to share some common inferred histories of the 
development of the NTRs from a simpler structure of two motifs. The edit operations appear to be occurring on a 1,000 
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FIGURE 2. Running time of NTRFinder (on a Pentium Dual core T4300 2.1 GHz) plotted against 
segment length on a log-log scale. The search was performed on segments of different lengths, with the 
minimum and maximum tandem repeat lengths set to 8 and 50 respectively. The distribution suggests 
the running time is approximately linear with sequence length. 
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year timescale, so this analysis offers the potential to date the prehistory of the early agriculture of this ancient staple food 
crop. 

7. Conclusion 

The nested tandem repeat structure is a complex structure that requires further analysis and study. The number of copy 
variants in the NTR region and the relationships between these copies might suggest a tandem repeat generation mecha- 
nism. In this paper, we have introduced a new algorithm to find nested tandem repeats. The first phase of the algorithm 
has 0(n(max tl )(max t2 )) time complexity, while the second phase (the alignment) needs 0(n(max tl )(max t2 )) space 
and time, where n is the length of the NTR region, and max tl , max t2 are the maximum allowed lengths of the tandem 
and interspersed motifs. 
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